Initial Exploration of the P/NBD Model
In this workbook we introduce the various different BTYD models, starting with a discussion of the underlying theory.
1 Background Theory
Before we start working on fitting and using the various Buy-Till-You-Die models, we first need to discuss the basic underlying theory and model.
In this model, we assume a customer becomes ‘alive’ to the business at the first purchase and then makes purchases stochastically but at a steady-rate for a period of time, and then ‘dies’ - i.e. becomes inactive to the business - hence the use of “Buy-Till-You-Die”.
Thus, at a high level these models decompose into modelling the transaction events using distributions such as the Poisson or Negative Binomial, and then modelling the ‘dropout’ process using some other method.
A number of BTYD models exist and for this workshop we will focus on the BG/NBD model - the Beta-Geometric Negative Binomial Distribution model (though we will discuss the P/NBD model also).
These models require only two pieces of information about each customer’s purchasing history: the “recency” (when the last transaction occurred) and “frequency” (the count of transactions made by that customer in a specified time period).
The notation used to represent this information is
\[ X = (x, \, t_x, \, T), \] where
\[ \begin{eqnarray*} x &=& \text{the number of transactions}, \\ T &=& \text{the observed time period}, \\ t_x &=& \text{the time since the last transaction}. \end{eqnarray*} \]
From this summary data we can fit most BTYD models.
2 BTYD Models
There are a number of different statistical approaches to building BTYD models - relying on a number of different assumptions about how the various recency, frequency and monetary values are modelled.
We now discuss a number of different ways of modelling this.
2.1 Pareto/Negative-Binomial Distribution (P/NBD) Model
The P/NBD model relies on five assumptions:
- While active, the number of transactions made by a customer follows a Poisson process with transaction rate \(\lambda\).
- Heterogeneity in \(\lambda\) follows a Gamma distribution \(\Gamma(\lambda \, | \, \alpha, r)\) with shape \(r\) and rate \(\alpha\).
- Each customer has an unobserved ‘lifetime’ of length \(\tau\). This point at which the customer becomes inactive is distributed as an exponential with dropout rate \(\mu\).
- Heterogeneity in dropout rates across customers follows a Gamma distribution \(\Gamma(\mu \, | \, s, \beta)\) with shape parameter \(s\) and rate parameter \(\beta\).
- The transaction rate \(\lambda\) and the dropout rate \(\mu\) vary independently across customers.
As before, we express this in mathematical notation as:
\[ \begin{eqnarray*} \lambda &\sim& \Gamma(\alpha, r), \\ \mu &\sim& \Gamma(s, \beta), \\ \tau &\sim& \text{Exponential}(\mu) \end{eqnarray*} \]
2.2 Beta-Geometric/Negative-Binomial Distribution (BG/NBD) Model
This model relies on a number of base assumptions, somewhat similar to the P/NBD model but modelling lifetime with a different process:
- While active, the number of transactions made by a customer follows a Poisson process with transaction rate \(\lambda\).
- Heterogeneity in \(\lambda\) follows a Gamma distribution \(\Gamma(\lambda \, | \, \alpha, r)\) with parameters shape \(r\) and rate \(\alpha\).
- After any transaction, a customer becomes inactive with probability \(p\).
- Heterogeneity in \(p\) follows a Beta distribution \(B(p \, | \, a, b)\) with shape parameters \(a\) and \(b\).
- The transaction rate \(\lambda\) and the dropout probability \(p\) vary independently across customers.
Note that it follows from the above assumptions that the probability of a customer being ‘alive’ after any transaction is given by the Geometric distribution, and hence the Beta-Geometric in the name.
To put this into more formal mathematical notation, we have:
\[ \begin{eqnarray*} \lambda &\sim& \Gamma(\alpha, r), \\ P(\text{alive}, k) &\sim& \text{Geometric}(p, k), \\ p &\sim& \text{Beta}(a, b) \end{eqnarray*} \]
3 Initial P/NBD Models
We start by modelling the P/NBD model using our synthetic datasets before we try to model real-life data.
Show code
use_fit_start_date <- as.Date("2020-01-01")
use_fit_end_date <- as.Date("2022-01-01")
use_valid_start_date <- as.Date("2022-01-01")
use_valid_end_date <- as.Date("2023-01-01")3.1 Load Short Time-frame Synthetic Data
We now want to load the short time-frame synthetic data.
Show code
customer_cohortdata_tbl <- read_rds("data/synthdata_shortframe_cohort_tbl.rds")
customer_cohortdata_tbl |> glimpse()Rows: 50,000
Columns: 4
$ customer_id <chr> "SFC202001_0001", "SFC202001_0002", "SFC202001_0003", "…
$ cohort_qtr <chr> "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1", …
$ cohort_ym <chr> "2020 01", "2020 01", "2020 01", "2020 01", "2020 01", …
$ first_tnx_date <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-0…
Show code
customer_simparams_tbl <- read_rds("data/synthdata_shortframe_simparams_tbl.rds")
customer_simparams_tbl |> glimpse()Rows: 50,000
Columns: 9
$ customer_id <chr> "SFC202001_0001", "SFC202001_0002", "SFC202001_0003", …
$ cohort_qtr <chr> "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1",…
$ cohort_ym <chr> "2020 01", "2020 01", "2020 01", "2020 01", "2020 01",…
$ first_tnx_date <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-…
$ customer_lambda <dbl> 0.12213805, 0.29987747, 0.31504009, 0.03856001, 0.1881…
$ customer_mu <dbl> 0.127118566, 0.096184402, 0.052334526, 0.204708842, 0.…
$ customer_tau <dbl> 29.5956480, 10.9437199, 1.6938450, 2.6798108, 48.75206…
$ customer_amtmn <dbl> 46.2371662, 111.0425353, 45.8870891, 43.0754249, 10.93…
$ customer_amtcv <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
Show code
customer_transactions_tbl <- read_rds("data/synthdata_shortframe_transactions_tbl.rds")
customer_transactions_tbl |> glimpse()Rows: 297,815
Columns: 4
$ customer_id <chr> "SFC202001_0028", "SFC202001_0006", "SFC202001_0012", "S…
$ tnx_timestamp <dttm> 2020-01-01 00:59:15, 2020-01-01 01:25:43, 2020-01-01 04…
$ invoice_id <chr> "T20200101-0001", "T20200101-0002", "T20200101-0003", "T…
$ tnx_amount <dbl> 269.07, 124.62, 20.81, 14.76, 16.02, 19.16, 211.91, 43.9…
We re-produce the visualisation of the transaction times we used in previous workbooks.
Show code
plot_tbl <- customer_transactions_tbl |>
group_nest(customer_id, .key = "cust_data") |>
filter(map_int(cust_data, nrow) > 3) |>
slice_sample(n = 30) |>
unnest(cust_data)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Customer Transaction Times"
) +
theme(axis.text.y = element_text(size = 10))3.2 Derive the Log-likelihood Function
We now turn our attention to deriving the log-likelihood model for the P/NBD model.
We assume that we know that a given customer has made \(x\) transactions after the initial one over an observed period of time \(T\), and we label these transactions \(t_1\), \(t_2\), …, \(t_x\).
To model the likelihood for this observation, we need to consider two possibilities: one for where the customer is still ‘alive’ at \(T\), and one where the customer has ‘died’ by \(T\).
In the first instance, the likelihood is the product of the observations of each transaction, multiplied by the likelihood of the customer still being alive at time \(T\).
Because we are modelling the transaction counts as a Poisson process, this corresponds to the times between events following an exponential distribution, and so both the transaction times and the lifetime likelihoods use the exponential.
This gives us:
\[ \begin{eqnarray*} L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T) &=& \lambda e^{-\lambda t_1} \lambda e^{-\lambda(t_2 - t_1)} ... \lambda e^{-\lambda (t_x - t_{x-1})} e^{-\lambda(T - t)} \\ &=& \lambda^x e^{-\lambda T} \end{eqnarray*} \]
and we can combine this with the likelihood of the lifetime of the customer \(\tau\) being greater than the observation window \(T\),
\[ P(\tau > T \, | \, \mu) = e^{-\mu T} \]
For the second case, the customer becomes inactive at some time \(\tau\) in the interval \((t_x, T]\), and so the likelihood is
\[ \begin{eqnarray*} L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T) &=& \lambda e^{-\lambda t_1} \lambda e^{-\lambda(t_2 - t_1)} ... \lambda e^{-\lambda (t_x - t_{x-1})} e^{-\lambda(\tau - t_x)} \\ &=& \lambda^x e^{-\lambda \tau} \end{eqnarray*} \]
In both cases we do not need the times of the individual transactions, and all we need are the values \((x, t_x, T)\).
As we cannot observe \(\tau\), we want to remove the conditioning on \(\tau\) by integrating it out.
\[ \begin{eqnarray*} L(\lambda, \mu \, | \, x, t_x, T) &=& L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T) \, P(\tau > T \, | \, \mu) \; + \\ & & \; \;\; \; \; \;\; \; \int^T_{t_x} L(\lambda \, | \, x, T, \text{ inactive at } (t_x, T] ) \, f(\tau \, | \mu) \, d\tau \\ &=& \lambda^x e^{-\lambda T} e^{\mu T} + \lambda^x \int^T_{t_x} e^{-\lambda \tau} \mu e^{-\mu \tau} d\tau \\ &=& \lambda^x e^{-(\lambda + \mu)T} + \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) t_x} + \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) T} \\ &=& \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) t_x} + \frac{\lambda^{x+1} \mu}{\lambda + \mu} e^{-(\lambda + \mu) T} \end{eqnarray*} \]
In Stan, we do not calculate the likelihoods but the Log-likelihood, so we need to take the log of this expression. This creates a problem, as we have no easy way to calculate \(\log(a + b)\). As this expression occurs a lot, Stan provides a log_sum_exp(), which is defined by
\[ \ln (a + b) = \text{LogSumExp}(\ln a, \, \ln b) \]
We then use this log_sum_exp() function to calculate the Log-Likelihood for the model. Note we use “LogSumExp” here to get around limitations in the renderer.
\[ \begin{eqnarray*} LL(\lambda, \mu \, | \, x, t_x, T) &=& \log \left( \frac{\lambda^x \, \mu}{\lambda + \mu} \left( e^{-(\lambda + \mu) t_x} + \lambda e^{-(\lambda + \mu) T} \right) \right) \\ &=& x \log \lambda + \log \mu - \log(\lambda + \mu) \; + \\ & & \;\;\;\;\;\;\; \text{LogSumExp}(-(\lambda + \mu) \, t_x, \; \log \lambda - (\lambda + \mu) \, T) \end{eqnarray*} \]
This is the log-likelihood model we want to fit in Stan.
3.3 Construct Datasets
Having loaded the synthetic data we need to construct a number of datasets of derived values.
Show code
customer_summarystats_tbl <- customer_transactions_tbl |>
calculate_transaction_cbs_data(last_date = use_fit_end_date)
customer_summarystats_tbl |> glimpse()Rows: 33,381
Columns: 6
$ customer_id <chr> "SFC202001_0001", "SFC202001_0002", "SFC202001_0003", "…
$ first_tnx_date <dttm> 2020-01-01 17:02:14, 2020-01-01 15:19:11, 2020-01-01 0…
$ last_tnx_date <dttm> 2020-05-31 15:28:54, 2020-03-05 04:15:05, 2020-01-08 1…
$ x <dbl> 3, 3, 2, 0, 9, 0, 0, 1, 0, 2, 5, 1, 3, 1, 1, 71, 7, 12,…
$ t_x <dbl> 21.556217, 9.076975, 1.054482, 0.000000, 48.249081, 0.0…
$ T_cal <dbl> 104.3272, 104.3374, 104.3875, 104.3695, 104.3064, 104.4…
As before, we construct a number of subsets of the data for use later on with the modelling and create some data subsets.
Show code
shuffle_tbl <- customer_summarystats_tbl |>
slice_sample(prop = 1, replace = FALSE)
id_50 <- shuffle_tbl |> head(50) |> pull(customer_id) |> sort()
id_1000 <- shuffle_tbl |> head(1000) |> pull(customer_id) |> sort()
id_5000 <- shuffle_tbl |> head(5000) |> pull(customer_id) |> sort()
id_10000 <- shuffle_tbl |> head(10000) |> pull(customer_id) |> sort()We then construct some fit data based on these values.
Show code
fit_1000_data_tbl <- customer_summarystats_tbl |> filter(customer_id %in% id_1000)
fit_1000_data_tbl |> glimpse()Rows: 1,000
Columns: 6
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "…
$ first_tnx_date <dttm> 2020-01-01 07:41:56, 2020-01-02 04:04:36, 2020-01-03 0…
$ last_tnx_date <dttm> 2020-05-08 12:18:39, 2020-01-22 08:40:44, 2020-03-29 1…
$ x <dbl> 2, 1, 1, 10, 1, 7, 0, 11, 0, 7, 13, 4, 0, 2, 0, 2, 0, 2…
$ t_x <dbl> 18.3072133, 2.8845382, 12.3386007, 40.7424467, 0.491829…
$ T_cal <dbl> 104.3827, 104.2614, 104.1167, 104.1358, 104.1138, 103.9…
Show code
fit_10000_data_tbl <- customer_summarystats_tbl |> filter(customer_id %in% id_10000)
fit_10000_data_tbl |> glimpse()Rows: 10,000
Columns: 6
$ customer_id <chr> "SFC202001_0004", "SFC202001_0007", "SFC202001_0009", "…
$ first_tnx_date <dttm> 2020-01-01 09:55:23, 2020-01-01 13:54:58, 2020-01-01 1…
$ last_tnx_date <dttm> 2020-01-01 09:55:23, 2020-01-01 13:54:58, 2020-01-01 1…
$ x <dbl> 0, 0, 0, 0, 30, 2, 0, 1, 0, 1, 21, 0, 0, 30, 56, 0, 0, …
$ t_x <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 61.5535…
$ T_cal <dbl> 104.3695, 104.3457, 104.3411, 104.3631, 104.3536, 104.3…
Finally, we also want to recreate our transaction visualisation for the first 50 customers randomly selected.
Show code
plot_tbl <- customer_transactions_tbl |>
filter(customer_id %in% id_50)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Customer Transaction Times"
) +
theme(axis.text.y = element_text(size = 10))3.4 Write Data
Show code
id_1000 |> write_rds("data/shortframe_id_1000.rds")
id_5000 |> write_rds("data/shortframe_id_5000.rds")
id_10000 |> write_rds("data/shortframe_id_10000.rds")
fit_1000_data_tbl |> write_rds("data/fit_1000_shortframe_data_tbl.rds")
fit_10000_data_tbl |> write_rds("data/fit_10000_shortframe_data_tbl.rds")
customer_summarystats_tbl |> write_rds("data/customer_summarystats_shortframe_tbl.rds")4 Fit Initial P/NBD Model
Show code
syslog(
glue("Creating the first P/NBD model"),
level = "INFO"
)We now construct our Stan model and prepare to fit it with our synthetic dataset.
Before we start on that, we set a few parameters for the workbook to organise our Stan code.
Show code
stan_modeldir <- "stan_models"
stan_codedir <- "stan_code"We also want to set a number of overall parameters for this workbook
To start the fit data, we want to use the 1,000 customers. We also need to calculate the summary statistics for the validation period.
Show code
customer_fit_stats_tbl <- fit_1000_data_tbl
customer_fit_stats_tbl |> glimpse()Rows: 1,000
Columns: 6
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "…
$ first_tnx_date <dttm> 2020-01-01 07:41:56, 2020-01-02 04:04:36, 2020-01-03 0…
$ last_tnx_date <dttm> 2020-05-08 12:18:39, 2020-01-22 08:40:44, 2020-03-29 1…
$ x <dbl> 2, 1, 1, 10, 1, 7, 0, 11, 0, 7, 13, 4, 0, 2, 0, 2, 0, 2…
$ t_x <dbl> 18.3072133, 2.8845382, 12.3386007, 40.7424467, 0.491829…
$ T_cal <dbl> 104.3827, 104.2614, 104.1167, 104.1358, 104.1138, 103.9…
Show code
customer_valid_stats_tbl <- customer_transactions_tbl |>
filter(
customer_id %in% id_1000,
tnx_timestamp > use_valid_start_date
) |>
summarise(
tnx_count = n(),
tnx_last_interval = difftime(
use_valid_end_date,
max(tnx_timestamp),
units = "weeks"
) |>
as.numeric(),
.by = customer_id
)
customer_valid_stats_tbl |> glimpse()Rows: 189
Columns: 3
$ customer_id <chr> "SFC202105_0281", "SFC202109_0705", "SFC202112_0941"…
$ tnx_count <int> 3, 7, 4, 18, 2, 27, 26, 2, 1, 1, 5, 30, 4, 8, 4, 2, …
$ tnx_last_interval <dbl> 36.1895432, 29.8301773, 13.0282031, 20.9624722, 51.3…
We start with the Stan model.
functions {
#include util_functions.stan
}
data {
int<lower=1> n; // number of customers
vector<lower=0>[n] t_x; // time to most recent purchase
vector<lower=0>[n] T_cal; // total observation time
vector<lower=0>[n] x; // number of purchases observed
real<lower=0> lambda_mn; // prior mean for lambda
real<lower=0> lambda_cv; // prior cv for lambda
real<lower=0> mu_mn; // prior mean for mu
real<lower=0> mu_cv; // prior mean for mu
}
transformed data {
real<lower=0> r = 1 / (lambda_cv * lambda_cv);
real<lower=0> alpha = 1 / (lambda_cv * lambda_cv * lambda_mn);
real<lower=0> s = 1 / (mu_cv * mu_cv);
real<lower=0> beta = 1 / (mu_cv * mu_cv * mu_mn);
}
parameters {
vector<lower=0>[n] lambda; // purchase rate
vector<lower=0>[n] mu; // lifetime dropout rate
}
model {
// setting priors
lambda ~ gamma(r, alpha);
mu ~ gamma(s, beta);
target += calculate_pnbd_loglik(n, lambda, mu, x, t_x, T_cal);
}
generated quantities {
vector[n] p_alive; // Probability that they are still "alive"
p_alive = 1 ./ (1 + mu ./ (mu + lambda) .* (exp((lambda + mu) .* (T_cal - t_x)) - 1));
}
This file contains a few new features of Stan - named file includes and user-defined functions - calculate_pnbd_loglik. We look at this file here:
real calculate_pnbd_loglik(int n, vector lambda, vector mu,
data vector x, data vector t_x, data vector T_cal) {
// likelihood
vector[n] t1;
vector[n] t2;
vector[n] lpm;
vector[n] lht;
vector[n] rht;
lpm = lambda + mu;
lht = log(lambda) - lpm .* T_cal;
rht = log(mu) - lpm .* t_x;
t1 = x .* log(lambda) - log(lpm);
for (i in 1:n) {
t2[i] = log_sum_exp(lht[i], rht[i]);
}
return(sum(t1) + sum(t2));
}
real calculate_bgnbd_loglik(int n, vector lambda, vector p,
data vector x, data vector t_x, data vector T_cal) {
// likelihood
vector[n] t1;
vector[n] t2;
vector[n] lht;
vector[n] rht;
lht = log(p) + (x-1) .* log(1-p) + x .* log(lambda) - lambda .* t_x;
rht = x .* log(1-p) + x .* log(lambda) - lambda .* T_cal;
for(i in 1:n) {
t2[i] = log_sum_exp(lht[i], rht[i]);
}
return(sum(t2));
}
4.1 Compile and Fit Stan Model
We now compile this model using CmdStanR.
Show code
pnbd_fixed_stanmodel <- cmdstan_model(
"stan_code/pnbd_fixed.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
Show code
stan_modelname <- "pnbd_init_short_fixed1"
stanfit_prefix <- str_c("fit_", stan_modelname)
stanfit_seed <- stanfit_seed + 1
stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")
stan_data_lst <- customer_fit_stats_tbl |>
select(customer_id, x, t_x, T_cal) |>
compose_data(
lambda_mn = 0.25,
lambda_cv = 1.00,
mu_mn = 0.10,
mu_cv = 1.00,
)
if(!file_exists(stanfit_object_file)) {
pnbd_init_short_fixed1_stanfit <- pnbd_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = stanfit_seed,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)
pnbd_init_short_fixed1_stanfit$save_object(stanfit_object_file)
} else {
pnbd_init_short_fixed1_stanfit <- read_rds(stanfit_object_file)
}Running MCMC with 4 chains, at most 8 in parallel...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 13.6 seconds.
Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 2 finished in 13.8 seconds.
Chain 3 finished in 13.7 seconds.
Chain 4 finished in 13.6 seconds.
All 4 chains finished successfully.
Mean chain execution time: 13.7 seconds.
Total execution time: 13.8 seconds.
Show code
pnbd_init_short_fixed1_stanfit$summary()# A tibble: 3,001 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <num> <num> <num> <num> <num> <num> <num> <num>
1 lp__ -1.57e+4 -1.57e+4 35.8 34.6 -1.58e+4 -1.57e+4 1.01 617.
2 lambda[1] 1.07e-1 8.94e-2 0.0722 0.0614 2.35e-2 2.51e-1 1.00 3545.
3 lambda[2] 2.01e-1 1.60e-1 0.154 0.127 3.08e-2 5.07e-1 1.00 3116.
4 lambda[3] 9.07e-2 7.35e-2 0.0699 0.0570 1.29e-2 2.29e-1 1.00 4801.
5 lambda[4] 2.27e-1 2.21e-1 0.0697 0.0664 1.26e-1 3.51e-1 1.01 4151.
6 lambda[5] 3.01e-1 2.32e-1 0.250 0.193 4.10e-2 8.15e-1 1.00 3406.
7 lambda[6] 5.02e-1 4.80e-1 0.186 0.175 2.45e-1 8.45e-1 1.00 4267.
8 lambda[7] 1.37e-1 8.06e-2 0.167 0.0903 4.60e-3 4.58e-1 1.00 2396.
9 lambda[8] 4.81e-1 4.66e-1 0.147 0.134 2.73e-1 7.47e-1 1.00 5377.
10 lambda[9] 1.50e-1 8.53e-2 0.195 0.0984 4.60e-3 4.96e-1 1.00 3084.
# ℹ 2,991 more rows
# ℹ 1 more variable: ess_tail <num>
We have some basic HMC-based validity statistics we can check.
Show code
pnbd_init_short_fixed1_stanfit$cmdstan_diagnose()Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed1-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed1-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed1-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed1-4.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
4.2 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
Show code
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
"mu[1]", "mu[2]", "mu[3]", "mu[4]"
)
pnbd_init_short_fixed1_stanfit$draws(inc_warmup = TRUE) |>
mcmc_trace(
pars = parameter_subset,
n_warmup = 500
) +
ggtitle("Full Traceplots of Some Lambda and Mu Values")As the warmup is skewing the y-axis somewhat, we repeat this process without the warmup.
Show code
pnbd_init_short_fixed1_stanfit$draws(inc_warmup = FALSE) |>
mcmc_trace(pars = parameter_subset) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda and Mu Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
Show code
pnbd_init_short_fixed1_stanfit |>
rhat(pars = c("lambda", "mu")) |>
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
Show code
pnbd_init_short_fixed1_stanfit |>
neff_ratio(pars = c("lambda", "mu")) |>
as.numeric() |>
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")Finally, we also want to look at autocorrelation in the chains for each parameter.
Show code
pnbd_init_short_fixed1_stanfit$draws() |>
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Values")As before, this first fit has a comprehensive run of fit diagnostics, but for the sake of brevity in later models we will show only the traceplots once we are satisfied with the validity of the sample.
4.3 Check Model Fit
As we are still working with synthetic data, we know the true values for each customer and so we can check how good our model is at recovering the true values on a customer-by-customer basis.
As in previous workbooks, we build our validation datasets and then check the distribution of \(q\)-values for both \(\lambda\) and \(\mu\) across the customer base.
Show code
pnbd_init_short_fixed1_valid_lst <- create_pnbd_posterior_validation_data(
stanfit = pnbd_init_short_fixed1_stanfit,
data_tbl = customer_fit_stats_tbl,
simparams_tbl = customer_simparams_tbl
)
pnbd_init_short_fixed1_valid_lst$lambda_qval_plot |> plot()Show code
pnbd_init_short_fixed1_valid_lst$mu_qval_plot |> plot()These plots looks like the model is recovering the parameters well, but cannot rely on this approach once we use real data so we will stop using this now.
4.4 Assess Model Fit Using Simulation
Rather than relying on knowing the ‘true’ answer, we instead will use our posterior sample to generate data and compare this simulated data against the data we fit. This procedure is similar to what we did before but now we focus on in sample data rather than using validation data.
Show code
pnbd_init_short_fixed1_simstats_tbl <- construct_pnbd_posterior_statistics(
stanfit = pnbd_init_short_fixed1_stanfit,
fitdata_tbl = customer_fit_stats_tbl
)
pnbd_init_short_fixed1_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 6
$ customer_id <chr> "SFC202001_0025", "SFC202001_0025", "SFC202001_0025", "…
$ first_tnx_date <dttm> 2020-01-01 07:41:56, 2020-01-01 07:41:56, 2020-01-01 0…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ post_lambda <dbl> 0.0386528, 0.2917810, 0.0272348, 0.2894580, 0.0330702, …
$ post_mu <dbl> 0.04573740, 0.05848340, 0.03846070, 0.07770960, 0.06496…
$ p_alive <dbl> 1.29145e-03, 4.82767e-13, 5.96495e-03, 8.88993e-14, 3.2…
We then use these posterior statistics as inputs to our simulations to help us assess the in-sample quality of fit.
Show code
fit_label <- "pnbd_init_short_fixed1"
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
Show code
pnbd_init_short_fixed1_fitsims_index_tbl <- pnbd_init_short_fixed1_simstats_tbl |>
mutate(
start_dttm = first_tnx_date |> as.POSIXct(),
end_dttm = use_fit_end_date |> as.POSIXct(),
lambda = post_lambda,
mu = post_mu,
p_alive = 1, ### In-sample validation, so customer begins active
tnx_mu = 100, ### We are not simulating tnx size, so put in defaults
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params") |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_fit_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed1_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed1/sims_fit_pnbd_init_sho…
We now use this setup to generate our simulations.
Show code
plan(multisession)
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed1_fitsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed1_fitsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 421
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed1_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed1/sims_fit_pnbd_init_sho…
We now want to load up the summary statistics for each of our customers for later analysis.
Show code
retrieve_sim_stats <- ~ .x |>
read_rds() |>
select(draw_id, sim_data, sim_tnx_count, sim_tnx_last)Show code
pnbd_init_short_fixed1_fit_simstats_tbl <- pnbd_init_short_fixed1_fitsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed1_fit"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed1_fit_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0025", "SFC202001_0025", "SFC202001_0025", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[4 x 2]>], [<tbl_df[0 x 2]>], [<tbl_df[1 x 2]>]…
$ sim_tnx_count <int> 4, 0, 1, 6, 6, 2, 3, 0, 3, 1, 0, 0, 2, 6, 0, 0, 7, 0, 10…
$ sim_tnx_last <dttm> 2020-06-03 20:43:26, NA, 2020-02-03 18:25:10, 2020-03-0…
We now use this data to check how well our model fits the data.
4.4.1 Compare Counts of Multitransaction Customers
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
Show code
obs_customer_count <- customer_fit_stats_tbl |>
filter(x > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed1_fit_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)The observed count of customers at least one additional transaction after the first is captured by this simulation.
4.4.2 Total Transaction Count
We now check the count of all transactions.
Show code
obs_total_count <- customer_fit_stats_tbl |>
pull(x) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed1_fit_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)As before, these all look good. Our model is doing a good job capturing the data.
4.4.3 Transaction Count Quantiles
We now look at the quantiles for the transaction counts across each customer.
Show code
obs_quantiles_tbl <- customer_fit_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(x, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed1_fit_simstats_tbl |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)4.4.4 Write to Disk
We write this data to disk
Show code
pnbd_init_short_fixed1_fit_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed1_assess_fit_simstats_tbl.rds")4.5 Assess Out-of-Sample Data
We now repeat this exercise, but for the validation period of 2022.
Show code
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
Show code
use_start <- use_valid_start_date |> as.POSIXct()
use_final <- use_valid_end_date |> as.POSIXct()
pnbd_init_short_fixed1_validsims_index_tbl <- pnbd_init_short_fixed1_simstats_tbl |>
mutate(
start_dttm = use_start,
end_dttm = use_final,
lambda = post_lambda,
mu = post_mu,
tnx_mu = 1, ### We are not simulating tnx size
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params") |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_valid_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed1_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed1/sims_valid_pnbd_init_s…
We now can run these simulations to check how well our model captures transactions out of the fitted data.
Show code
plan(multisession)
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed1_validsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed1_validsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 421
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed1_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed1/sims_valid_pnbd_init_s…
Now that we have generated our simulations we want to load the data from the files and construct a dataset for use as part of the validation.
Show code
pnbd_init_short_fixed1_valid_simstats_tbl <- pnbd_init_short_fixed1_validsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed1_valid"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed1_valid_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0025", "SFC202001_0025", "SFC202001_0025", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[0 x 2]>], [<tbl_df[0 x 2]>], [<tbl_df[0 x 2]>]…
$ sim_tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ sim_tnx_last <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
4.5.1 Compare Counts of Multitransaction Customers
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
Show code
obs_customer_count <- customer_valid_stats_tbl |>
filter(tnx_count > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed1_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)4.5.2 Total Transaction Count
We now check the count of all transactions.
Show code
obs_total_count <- customer_valid_stats_tbl |>
pull(tnx_count) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed1_valid_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)4.5.3 Transaction Count Quantiles
We now look at the quantiles for the transaction counts across each customer.
Show code
obs_quantiles_tbl <- customer_valid_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(tnx_count, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed1_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)4.5.4 Write Data to Disk
We write this data to disk
Show code
pnbd_init_short_fixed1_valid_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed1_assess_valid_simstats_tbl.rds")4.6 Final Assessment
All of this looks good, and our model appears to do a good job of recreating the data both in and out of sample, but this is not surprising since we mostly used the same input parameters.
For real life work, we will not have this benefit, so it is worth seeing how sensitive this process is to the prior parameters.
5 Fit Alternate Prior Model.
Show code
syslog(
glue("Creating the second P/NBD model (lower CoV)"),
level = "INFO"
)We want to try an alternate prior model with a smaller co-efficient of variation to see what impact it has on our procedures.
Show code
stan_modelname <- "pnbd_init_short_fixed2"
stanfit_prefix <- str_c("fit_", stan_modelname)
stanfit_seed <- stanfit_seed + 1
stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")
stan_data_lst <- customer_fit_stats_tbl |>
select(customer_id, x, t_x, T_cal) |>
compose_data(
lambda_mn = 0.25,
lambda_cv = 0.50,
mu_mn = 0.10,
mu_cv = 0.50,
)
if(!file_exists(stanfit_object_file)) {
pnbd_init_short_fixed2_stanfit <- pnbd_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = stanfit_seed,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)
pnbd_init_short_fixed2_stanfit$save_object(stanfit_object_file)
} else {
pnbd_init_short_fixed2_stanfit <- read_rds(stanfit_object_file)
}Running MCMC with 4 chains, at most 8 in parallel...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 12.6 seconds.
Chain 2 finished in 12.6 seconds.
Chain 3 finished in 12.5 seconds.
Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 4 finished in 12.6 seconds.
All 4 chains finished successfully.
Mean chain execution time: 12.6 seconds.
Total execution time: 13.0 seconds.
Show code
pnbd_init_short_fixed2_stanfit$summary()# A tibble: 3,001 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <num> <num> <num> <num> <num> <num> <num> <num>
1 lp__ -3.33e+4 -3.33e+4 32.4 32.5 -3.33e+4 -3.32e+4 1.00 671.
2 lambda[1] 1.57e-1 1.45e-1 0.0698 0.0632 6.49e-2 2.82e-1 1.00 3752.
3 lambda[2] 2.30e-1 2.14e-1 0.106 0.0976 9.30e-2 4.33e-1 1.01 4771.
4 lambda[3] 1.54e-1 1.45e-1 0.0699 0.0637 6.03e-2 2.82e-1 1.00 5939.
5 lambda[4] 2.34e-1 2.29e-1 0.0632 0.0590 1.39e-1 3.47e-1 1.01 5037.
6 lambda[5] 2.62e-1 2.40e-1 0.127 0.118 9.68e-2 5.02e-1 1.00 4630.
7 lambda[6] 3.89e-1 3.78e-1 0.118 0.119 2.24e-1 5.98e-1 1.00 5090.
8 lambda[7] 2.10e-1 1.88e-1 0.112 0.105 6.45e-2 4.13e-1 1.00 3785.
9 lambda[8] 4.00e-1 3.92e-1 0.107 0.107 2.46e-1 5.87e-1 1.00 6422.
10 lambda[9] 2.11e-1 1.91e-1 0.116 0.103 6.74e-2 4.26e-1 1.00 4846.
# ℹ 2,991 more rows
# ℹ 1 more variable: ess_tail <num>
We have some basic HMC-based validity statistics we can check.
Show code
pnbd_init_short_fixed2_stanfit$cmdstan_diagnose()Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed2-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed2-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed2-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed2-4.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
5.1 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
Show code
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
"mu[1]", "mu[2]", "mu[3]", "mu[4]"
)
pnbd_init_short_fixed2_stanfit$draws(inc_warmup = FALSE) |>
mcmc_trace(pars = parameter_subset) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda and Mu Values"
) +
theme(axis.text.x = element_text(size = 10))We want to check the \(N_{eff}\) statistics also.
Show code
pnbd_init_short_fixed2_stanfit |>
neff_ratio(pars = c("lambda", "mu")) |>
as.numeric() |>
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")5.2 Assess Model Fit Using Simulation
We now want to check our in-sample data.
Show code
pnbd_init_short_fixed2_simstats_tbl <- construct_pnbd_posterior_statistics(
stanfit = pnbd_init_short_fixed2_stanfit,
fitdata_tbl = customer_fit_stats_tbl
)
pnbd_init_short_fixed2_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 6
$ customer_id <chr> "SFC202001_0025", "SFC202001_0025", "SFC202001_0025", "…
$ first_tnx_date <dttm> 2020-01-01 07:41:56, 2020-01-01 07:41:56, 2020-01-01 0…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ post_lambda <dbl> 0.1779810, 0.1129470, 0.2258780, 0.2368870, 0.1377210, …
$ post_mu <dbl> 0.0455497, 0.0620248, 0.1229170, 0.1167380, 0.0440958, …
$ p_alive <dbl> 2.16170e-08, 8.12079e-07, 2.59553e-13, 1.82846e-13, 6.5…
Show code
fit_label <- "pnbd_init_short_fixed2"
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
Show code
pnbd_init_short_fixed2_fitsims_index_tbl <- pnbd_init_short_fixed2_simstats_tbl |>
mutate(
start_dttm = first_tnx_date |> as.POSIXct(),
end_dttm = use_fit_end_date |> as.POSIXct(),
lambda = post_lambda,
mu = post_mu,
p_alive = 1, ### In-sample validation, so customer begins active
tnx_mu = 100, ### We are not simulating tnx size, so put in defaults
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params") |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_fit_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed2_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed2/sims_fit_pnbd_init_sho…
We now use this setup to generate our simulations.
Show code
plan(multisession)
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed2_fitsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed2_fitsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 421
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed2_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed2/sims_fit_pnbd_init_sho…
We now want to load up the summary statistics for each of our customers for later analysis.
Show code
pnbd_init_short_fixed2_fit_simstats_tbl <- pnbd_init_short_fixed2_fitsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed2_fit"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed2_fit_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0025", "SFC202001_0025", "SFC202001_0025", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[4 x 2]>], [<tbl_df[0 x 2]>], [<tbl_df[1 x 2]>]…
$ sim_tnx_count <int> 4, 0, 1, 12, 0, 1, 2, 0, 1, 0, 1, 0, 0, 0, 5, 9, 3, 1, 0…
$ sim_tnx_last <dttm> 2020-03-01 01:06:07, NA, 2020-01-17 01:40:33, 2020-06-3…
We now use this data to check how well our model fits the data.
5.2.1 Compare Counts of Multitransaction Customers
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
Show code
obs_customer_count <- customer_fit_stats_tbl |>
filter(x > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed2_fit_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)The observed count of customers at least one additional transaction after the first is captured by this simulation.
5.2.2 Total Transaction Count
We now check the count of all transactions.
Show code
obs_total_count <- customer_fit_stats_tbl |>
pull(x) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed2_fit_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)As before, these all look good. Our model is doing a good job capturing the data.
5.2.3 Transaction Count Quantiles
We now look at the quantiles for the transaction counts across each customer.
Show code
obs_quantiles_tbl <- customer_fit_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(x, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed2_fit_simstats_tbl |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)5.2.4 Write to Disk
We write this data to disk
Show code
pnbd_init_short_fixed2_fit_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed2_assess_fit_simstats_tbl.rds")5.3 Assess Out-of-Sample Data
We now repeat this exercise, but for the validation period of 2022.
Show code
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
Show code
use_start <- use_valid_start_date |> as.POSIXct()
use_final <- use_valid_end_date |> as.POSIXct()
pnbd_init_short_fixed2_validsims_index_tbl <- pnbd_init_short_fixed2_simstats_tbl |>
mutate(
start_dttm = use_start,
end_dttm = use_final,
lambda = post_lambda,
mu = post_mu,
tnx_mu = 1, ### We are not simulating tnx size
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params") |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_valid_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed2_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed2/sims_valid_pnbd_init_s…
We now can run these simulations to check how well our model captures transactions out of the fitted data.
Show code
plan(multisession)
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed2_validsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed2_validsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 421
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed2_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed2/sims_valid_pnbd_init_s…
Now that we have generated our simulations we want to load the data from the files and construct a dataset for use as part of the validation.
Show code
pnbd_init_short_fixed2_valid_simstats_tbl <- pnbd_init_short_fixed2_validsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed2_valid"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed2_valid_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0025", "SFC202001_0025", "SFC202001_0025", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[0 x 2]>], [<tbl_df[0 x 2]>], [<tbl_df[0 x 2]>]…
$ sim_tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ sim_tnx_last <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
5.3.1 Compare Counts of Multitransaction Customers
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
Show code
obs_customer_count <- customer_valid_stats_tbl |>
filter(tnx_count > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed1_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)5.3.2 Total Transaction Count
We now check the count of all transactions.
Show code
obs_total_count <- customer_valid_stats_tbl |>
pull(tnx_count) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed2_valid_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)5.3.3 Transaction Count Quantiles
We now look at the quantiles for the transaction counts across each customer.
Show code
obs_quantiles_tbl <- customer_valid_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(tnx_count, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed2_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)5.3.4 Write to Disk
We write this data to disk
Show code
pnbd_init_short_fixed2_valid_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed2_assess_valid_simstats_tbl.rds")5.4 Final Assessment
This model appears to be struggling to properly account for the transaction frequency, as the distribution of transaction counts has a bias to the low side.
This may be do to the compressed nature of the Gamma priors due to the smaller value of the co-efficient of variation. One final model might be to try wider priors to see how they work.
6 Fit Higher-CV Prior Model.
Show code
syslog(
glue("Creating the third P/NBD model (higher CoV)"),
level = "INFO"
)We want to try an alternate prior model with a larger co-efficient of variation to see what impact it has on our procedures.
Show code
stan_modelname <- "pnbd_init_short_fixed3"
stanfit_prefix <- str_c("fit_", stan_modelname)
stanfit_seed <- stanfit_seed + 1
stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")
stan_data_lst <- customer_fit_stats_tbl |>
select(customer_id, x, t_x, T_cal) |>
compose_data(
lambda_mn = 0.25,
lambda_cv = 2.00,
mu_mn = 0.10,
mu_cv = 2.00,
)
if(!file_exists(stanfit_object_file)) {
pnbd_init_short_fixed3_stanfit <- pnbd_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = stanfit_seed,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)
pnbd_init_short_fixed3_stanfit$save_object(stanfit_object_file)
} else {
pnbd_init_short_fixed3_stanfit <- read_rds(stanfit_object_file)
}Running MCMC with 4 chains, at most 8 in parallel...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 3 finished in 34.0 seconds.
Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 4 finished in 33.9 seconds.
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 34.8 seconds.
Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 2 finished in 34.7 seconds.
All 4 chains finished successfully.
Mean chain execution time: 34.3 seconds.
Total execution time: 35.2 seconds.
Show code
pnbd_init_short_fixed3_stanfit$summary()# A tibble: 3,001 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <num> <num> <num> <num> <num> <num> <num> <num>
1 lp__ -1.13e+4 -1.13e+4 40.3 4.11e+1 -1.13e+4 -1.12e+4 1.01 473.
2 lambda[1] 7.82e-2 6.13e-2 0.0646 5.17e-2 1.03e-2 2.08e-1 1.00 1847.
3 lambda[2] 1.80e-1 1.13e-1 0.203 1.23e-1 6.49e-3 5.71e-1 1.00 2187.
4 lambda[3] 4.61e-2 2.63e-2 0.0537 3.00e-2 2.26e-3 1.55e-1 1.00 1575.
5 lambda[4] 2.24e-1 2.15e-1 0.0759 7.41e-2 1.15e-1 3.61e-1 1.00 2664.
6 lambda[5] 4.38e-1 2.73e-1 0.481 2.98e-1 2.17e-2 1.39e+0 1.00 1668.
7 lambda[6] 5.69e-1 5.39e-1 0.224 2.13e-1 2.56e-1 9.88e-1 1.00 3001.
8 lambda[7] 4.70e-2 2.81e-3 0.148 4.16e-3 3.37e-7 2.58e-1 1.00 1130.
9 lambda[8] 5.08e-1 4.94e-1 0.156 1.58e-1 2.83e-1 7.74e-1 1.00 3024.
10 lambda[9] 4.96e-2 2.74e-3 0.167 4.06e-3 2.37e-7 2.36e-1 1.00 1239.
# ℹ 2,991 more rows
# ℹ 1 more variable: ess_tail <num>
We have some basic HMC-based validity statistics we can check.
Show code
pnbd_init_short_fixed3_stanfit$cmdstan_diagnose()Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed3-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed3-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed3-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_short_fixed3-4.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
The following parameters had split R-hat greater than 1.05:
p_alive[99], p_alive[181], p_alive[239], p_alive[255], p_alive[359], p_alive[531]
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizating your model with additional prior information or a more effective parameterization.
Processing complete.
6.1 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
Show code
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
"mu[1]", "mu[2]", "mu[3]", "mu[4]"
)
pnbd_init_short_fixed3_stanfit$draws(inc_warmup = FALSE) |>
mcmc_trace(pars = parameter_subset) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda and Mu Values"
) +
theme(axis.text.x = element_text(size = 10))We want to check the \(N_{eff}\) statistics also.
Show code
pnbd_init_short_fixed3_stanfit |>
neff_ratio(pars = c("lambda", "mu")) |>
as.numeric() |>
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")6.2 Assess Model Fit Using Simulation
We now want to check our in-sample data.
Show code
pnbd_init_short_fixed3_simstats_tbl <- construct_pnbd_posterior_statistics(
stanfit = pnbd_init_short_fixed3_stanfit,
fitdata_tbl = customer_fit_stats_tbl
)
pnbd_init_short_fixed3_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 6
$ customer_id <chr> "SFC202001_0025", "SFC202001_0025", "SFC202001_0025", "…
$ first_tnx_date <dttm> 2020-01-01 07:41:56, 2020-01-01 07:41:56, 2020-01-01 0…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ post_lambda <dbl> 0.0486667, 0.0286213, 0.0277680, 0.1675780, 0.0133441, …
$ post_mu <dbl> 0.10119600, 0.01795630, 0.00192808, 0.00481450, 0.00186…
$ p_alive <dbl> 3.70106e-06, 4.57516e-02, 5.64431e-01, 1.28698e-05, 7.5…
Show code
fit_label <- "pnbd_init_short_fixed3"
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
Show code
pnbd_init_short_fixed3_fitsims_index_tbl <- pnbd_init_short_fixed3_simstats_tbl |>
mutate(
start_dttm = first_tnx_date |> as.POSIXct(),
end_dttm = use_fit_end_date |> as.POSIXct(),
lambda = post_lambda,
mu = post_mu,
p_alive = 1, ### In-sample validation, so customer begins active
tnx_mu = 100, ### We are not simulating tnx size, so put in defaults
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params") |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_fit_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed3_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed3/sims_fit_pnbd_init_sho…
We now use this setup to generate our simulations.
Show code
plan(multisession)
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed3_fitsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed3_fitsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 421
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed3_fitsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed3/sims_fit_pnbd_init_sho…
We now want to load up the summary statistics for each of our customers for later analysis.
Show code
pnbd_init_short_fixed3_fit_simstats_tbl <- pnbd_init_short_fixed3_fitsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed3_fit"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed3_fit_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0025", "SFC202001_0025", "SFC202001_0025", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[0 x 2]>], [<tbl_df[17 x 2]>], [<tbl_df[3 x 2]>…
$ sim_tnx_count <int> 0, 17, 3, 2, 3, 19, 0, 3, 1, 2, 0, 0, 1, 0, 1, 1, 18, 6,…
$ sim_tnx_last <dttm> NA, 2021-11-18 23:16:41, 2020-08-10 19:03:57, 2020-02-1…
We now use this data to check how well our model fits the data.
6.2.1 Compare Counts of Multitransaction Customers
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
Show code
obs_customer_count <- customer_fit_stats_tbl |>
filter(x > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed3_fit_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)The observed count of customers at least one additional transaction after the first is captured by this simulation.
6.2.2 Total Transaction Count
We now check the count of all transactions.
Show code
obs_total_count <- customer_fit_stats_tbl |>
pull(x) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed3_fit_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)As before, these all look good. Our model is doing a good job capturing the data.
6.2.3 Transaction Count Quantiles
We now look at the quantiles for the transaction counts across each customer.
Show code
obs_quantiles_tbl <- customer_fit_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(x, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed3_fit_simstats_tbl |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)6.2.4 Write to Disk
We write this data to disk
Show code
pnbd_init_short_fixed3_fit_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed3_assess_fit_simstats_tbl.rds")6.3 Assess Out-of-Sample Data
We now repeat this exercise, but for the validation period of 2022.
Show code
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
Show code
use_start <- use_valid_start_date |> as.POSIXct()
use_final <- use_valid_end_date |> as.POSIXct()
pnbd_init_short_fixed3_validsims_index_tbl <- pnbd_init_short_fixed3_simstats_tbl |>
mutate(
start_dttm = use_start,
end_dttm = use_final,
lambda = post_lambda,
mu = post_mu,
tnx_mu = 1, ### We are not simulating tnx size
tnx_cv = 1 ###
) |>
group_nest(customer_id, .key = "cust_params") |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_valid_{fit_label}_{customer_id}.rds"
)
)
pnbd_init_short_fixed3_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed3/sims_valid_pnbd_init_s…
We now can run these simulations to check how well our model captures transactions out of the fitted data.
Show code
plan(multisession)
precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
as.character() |>
enframe(name = NULL, value = "sim_file")
runsims_tbl <- pnbd_init_short_fixed3_validsims_index_tbl |>
anti_join(precomputed_tbl, by = "sim_file")
if(nrow(runsims_tbl) > 0) {
pnbd_init_short_fixed3_validsims_index_tbl <- runsims_tbl |>
mutate(
chunk_data = future_map2_int(
cust_params, sim_file,
run_simulations_chunk,
sim_func = generate_pnbd_validation_transactions,
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 421
),
.progress = TRUE
)
)
}
pnbd_init_short_fixed3_validsims_index_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "SFC202001_0025", "SFC202001_0084", "SFC202001_0091", "SFC…
$ cust_params <list<tibble[,11]>> [<tbl_df[2000 x 11]>], [<tbl_df[2000 x 11]>]…
$ sim_file <glue> "precompute/pnbd_init_short_fixed3/sims_valid_pnbd_init_s…
Now that we have generated our simulations we want to load the data from the files and construct a dataset for use as part of the validation.
Show code
pnbd_init_short_fixed3_valid_simstats_tbl <- pnbd_init_short_fixed3_validsims_index_tbl |>
mutate(
sim_data = map(
sim_file, retrieve_sim_stats,
.progress = "pnbd_init_short_fixed3_valid"
)
) |>
select(customer_id, sim_data) |>
unnest(sim_data)
pnbd_init_short_fixed3_valid_simstats_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "SFC202001_0025", "SFC202001_0025", "SFC202001_0025", "S…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sim_data <list> [<tbl_df[0 x 2]>], [<tbl_df[0 x 2]>], [<tbl_df[0 x 2]>]…
$ sim_tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ sim_tnx_last <dttm> NA, NA, NA, NA, NA, NA, NA, 2022-04-05 23:32:21, 2022-0…
6.3.1 Compare Counts of Multitransaction Customers
We start by checking the high level summary statistics, such as customers with more than one transaction both in the observed data and the simulation data, total transaction count observed
Show code
obs_customer_count <- customer_valid_stats_tbl |>
filter(tnx_count > 0) |>
nrow()
sim_data_tbl <- pnbd_init_short_fixed1_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
count(draw_id, name = "sim_customer_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_customer_count), bins = 50) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Count of Multi-transaction Customers",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Customer Counts",
subtitle = "(observed value in red)"
)6.3.2 Total Transaction Count
We now check the count of all transactions.
Show code
obs_total_count <- customer_valid_stats_tbl |>
pull(tnx_count) |>
sum()
sim_data_tbl <- pnbd_init_short_fixed3_valid_simstats_tbl |>
count(draw_id, wt = sim_tnx_count, name = "sim_total_count")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_total_count), bins = 50) +
geom_vline(aes(xintercept = obs_total_count), colour = "red") +
labs(
x = "Count of Total Transactions",
y = "Frequency",
title = "Comparison Plot of Simulated vs Observed Total Counts",
subtitle = "(observed value in red)"
)6.3.3 Transaction Count Quantiles
We now look at the quantiles for the transaction counts across each customer.
Show code
obs_quantiles_tbl <- customer_valid_stats_tbl |>
reframe(
prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
prob_value = quantile(tnx_count, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
)
sim_data_tbl <- pnbd_init_short_fixed3_valid_simstats_tbl |>
filter(sim_tnx_count > 0) |>
group_by(draw_id) |>
summarise(
p10 = quantile(sim_tnx_count, 0.10),
p25 = quantile(sim_tnx_count, 0.25),
p50 = quantile(sim_tnx_count, 0.50),
p75 = quantile(sim_tnx_count, 0.75),
p90 = quantile(sim_tnx_count, 0.90),
p99 = quantile(sim_tnx_count, 0.99)
) |>
pivot_longer(
cols = !draw_id,
names_to = "prob_label",
values_to = "sim_prob_values"
) |>
inner_join(obs_quantiles_tbl, by = "prob_label")
ggplot(sim_data_tbl) +
geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
geom_vline(aes(xintercept = prob_value), colour = "red") +
facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
labs(
x = "Quantile of Counts",
y = "Frequency",
title = "Comparison Plots of Transaction Count Quantiles"
)6.3.4 Write to Disk
We write this data to disk
Show code
pnbd_init_short_fixed3_fit_simstats_tbl |>
write_rds("data/pnbd_init_short_fixed3_assess_valid_simstats_tbl.rds")6.4 Final Assessment
The behaviour of this model is a little curious - in-sample the model under-estimates the count of repeat customers but out-of-sample it is fine.
At the same time, the model over-estimates the total transaction count for both in-sample and out of sample.
Similarly, the higher quantiles of transaction count is over-estimated.
This suggests that the model tends towards higher transaction frequency, but lower lifetime values. This should be useful when fitting on real data.
7 R Environment
Show code
options(width = 120L)
sessioninfo::session_info()─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.2.3 (2023-03-15)
os Ubuntu 22.04.2 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Dublin
date 2023-05-18
pandoc 2.19.2 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] RSPM (R 4.2.0)
arrayhelpers 1.1-0 2020-02-04 [1] RSPM (R 4.2.0)
backports 1.4.1 2021-12-13 [1] RSPM (R 4.2.0)
base64enc 0.1-3 2015-07-28 [1] RSPM (R 4.2.0)
bayesplot * 1.10.0 2022-11-16 [1] RSPM (R 4.2.0)
bit 4.0.5 2022-11-15 [1] RSPM (R 4.2.0)
bit64 4.0.5 2020-08-30 [1] RSPM (R 4.2.0)
boot 1.3-28.1 2022-11-22 [2] CRAN (R 4.2.3)
bridgesampling 1.1-2 2021-04-16 [1] RSPM (R 4.2.0)
brms * 2.19.0 2023-03-14 [1] RSPM (R 4.2.0)
Brobdingnag 1.2-9 2022-10-19 [1] RSPM (R 4.2.0)
cachem 1.0.7 2023-02-24 [1] RSPM (R 4.2.0)
callr 3.7.3 2022-11-02 [1] RSPM (R 4.2.0)
checkmate 2.1.0 2022-04-21 [1] RSPM (R 4.2.0)
cli 3.6.1 2023-03-23 [1] RSPM (R 4.2.0)
cmdstanr * 0.5.3 2023-05-11 [1] Github (stan-dev/cmdstanr@22b391e)
coda 0.19-4 2020-09-30 [1] RSPM (R 4.2.0)
codetools 0.2-19 2023-02-01 [2] CRAN (R 4.2.3)
colorspace 2.1-0 2023-01-23 [1] RSPM (R 4.2.0)
colourpicker 1.2.0 2022-10-28 [1] RSPM (R 4.2.0)
conflicted * 1.2.0 2023-02-01 [1] RSPM (R 4.2.0)
cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.2.0)
crayon 1.5.2 2022-09-29 [1] RSPM (R 4.2.0)
crosstalk 1.2.0 2021-11-04 [1] RSPM (R 4.2.0)
data.table 1.14.8 2023-02-17 [1] RSPM (R 4.2.0)
digest 0.6.31 2022-12-11 [1] RSPM (R 4.2.0)
directlabels * 2021.1.13 2021-01-16 [1] RSPM (R 4.2.0)
distributional 0.3.2 2023-03-22 [1] RSPM (R 4.2.0)
dplyr * 1.1.1 2023-03-22 [1] RSPM (R 4.2.0)
DT 0.27 2023-01-17 [1] RSPM (R 4.2.0)
dygraphs 1.1.1.6 2018-07-11 [1] RSPM (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.2.0)
evaluate 0.20 2023-01-17 [1] RSPM (R 4.2.0)
fansi 1.0.4 2023-01-22 [1] RSPM (R 4.2.0)
farver 2.1.1 2022-07-06 [1] RSPM (R 4.2.0)
fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.2.0)
forcats * 1.0.0 2023-01-29 [1] RSPM (R 4.2.0)
fs * 1.6.1 2023-02-06 [1] RSPM (R 4.2.0)
furrr * 0.3.1 2022-08-15 [1] RSPM (R 4.2.0)
future * 1.32.0 2023-03-07 [1] RSPM (R 4.2.0)
gamm4 0.2-6 2020-04-03 [1] RSPM (R 4.2.0)
generics 0.1.3 2022-07-05 [1] RSPM (R 4.2.0)
ggdist 3.2.1 2023-01-18 [1] RSPM (R 4.2.0)
ggplot2 * 3.4.2 2023-04-03 [1] RSPM (R 4.2.0)
globals 0.16.2 2022-11-21 [1] RSPM (R 4.2.0)
glue * 1.6.2 2022-02-24 [1] RSPM (R 4.2.0)
gridExtra 2.3 2017-09-09 [1] RSPM (R 4.2.0)
gtable 0.3.3 2023-03-21 [1] RSPM (R 4.2.0)
gtools 3.9.4 2022-11-27 [1] RSPM (R 4.2.0)
hms 1.1.3 2023-03-21 [1] RSPM (R 4.2.0)
htmltools 0.5.5 2023-03-23 [1] RSPM (R 4.2.0)
htmlwidgets 1.6.2 2023-03-17 [1] RSPM (R 4.2.0)
httpuv 1.6.9 2023-02-14 [1] RSPM (R 4.2.0)
igraph 1.4.2 2023-04-07 [1] RSPM (R 4.2.0)
inline 0.3.19 2021-05-31 [1] RSPM (R 4.2.0)
jsonlite 1.8.4 2022-12-06 [1] RSPM (R 4.2.0)
knitr 1.42 2023-01-25 [1] RSPM (R 4.2.0)
labeling 0.4.2 2020-10-20 [1] RSPM (R 4.2.0)
later 1.3.0 2021-08-18 [1] RSPM (R 4.2.0)
lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.3)
lifecycle 1.0.3 2022-10-07 [1] RSPM (R 4.2.0)
listenv 0.9.0 2022-12-16 [1] RSPM (R 4.2.0)
lme4 1.1-32 2023-03-14 [1] RSPM (R 4.2.0)
loo 2.6.0 2023-03-31 [1] RSPM (R 4.2.0)
lubridate * 1.9.2 2023-02-10 [1] RSPM (R 4.2.0)
magrittr * 2.0.3 2022-03-30 [1] RSPM (R 4.2.0)
markdown 1.6 2023-04-07 [1] RSPM (R 4.2.0)
MASS 7.3-58.2 2023-01-23 [2] CRAN (R 4.2.3)
Matrix 1.5-3 2022-11-11 [2] CRAN (R 4.2.3)
matrixStats 0.63.0 2022-11-18 [1] RSPM (R 4.2.0)
memoise 2.0.1 2021-11-26 [1] RSPM (R 4.2.0)
mgcv 1.8-42 2023-03-02 [2] CRAN (R 4.2.3)
mime 0.12 2021-09-28 [1] RSPM (R 4.2.0)
miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.2.0)
minqa 1.2.5 2022-10-19 [1] RSPM (R 4.2.0)
munsell 0.5.0 2018-06-12 [1] RSPM (R 4.2.0)
mvtnorm 1.1-3 2021-10-08 [1] RSPM (R 4.2.0)
nlme 3.1-162 2023-01-31 [2] CRAN (R 4.2.3)
nloptr 2.0.3 2022-05-26 [1] RSPM (R 4.2.0)
parallelly 1.35.0 2023-03-23 [1] RSPM (R 4.2.0)
pillar 1.9.0 2023-03-22 [1] RSPM (R 4.2.0)
pkgbuild 1.4.0 2022-11-27 [1] RSPM (R 4.2.0)
pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.2.0)
plyr 1.8.8 2022-11-11 [1] RSPM (R 4.2.0)
posterior * 1.4.1 2023-03-14 [1] RSPM (R 4.2.0)
prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.2.0)
processx 3.8.1 2023-04-18 [1] RSPM (R 4.2.0)
projpred 2.5.0 2023-04-05 [1] RSPM (R 4.2.0)
promises 1.2.0.1 2021-02-11 [1] RSPM (R 4.2.0)
ps 1.7.5 2023-04-18 [1] RSPM (R 4.2.0)
purrr * 1.0.1 2023-01-10 [1] RSPM (R 4.2.0)
quadprog 1.5-8 2019-11-20 [1] RSPM (R 4.2.0)
R6 2.5.1 2021-08-19 [1] RSPM (R 4.2.0)
Rcpp * 1.0.10 2023-01-22 [1] RSPM (R 4.2.0)
RcppParallel 5.1.7 2023-02-27 [1] RSPM (R 4.2.0)
readr * 2.1.4 2023-02-10 [1] RSPM (R 4.2.0)
reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.2.0)
rlang * 1.1.0 2023-03-14 [1] RSPM (R 4.2.0)
rmarkdown 2.21 2023-03-26 [1] RSPM (R 4.2.0)
rstan 2.21.8 2023-01-17 [1] RSPM (R 4.2.0)
rstantools 2.3.1 2023-03-30 [1] RSPM (R 4.2.0)
rsyslog * 1.0.2 2021-06-04 [1] RSPM (R 4.2.0)
scales * 1.2.1 2022-08-20 [1] RSPM (R 4.2.0)
sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.2.0)
shiny 1.7.4 2022-12-15 [1] RSPM (R 4.2.0)
shinyjs 2.1.0 2021-12-23 [1] RSPM (R 4.2.0)
shinystan 2.6.0 2022-03-03 [1] RSPM (R 4.2.0)
shinythemes 1.2.0 2021-01-25 [1] RSPM (R 4.2.0)
StanHeaders 2.21.0-7 2020-12-17 [1] RSPM (R 4.2.0)
stringi 1.7.12 2023-01-11 [1] RSPM (R 4.2.0)
stringr * 1.5.0 2022-12-02 [1] RSPM (R 4.2.0)
svUnit 1.0.6 2021-04-19 [1] RSPM (R 4.2.0)
tensorA 0.36.2 2020-11-19 [1] RSPM (R 4.2.0)
threejs 0.3.3 2020-01-21 [1] RSPM (R 4.2.0)
tibble * 3.2.1 2023-03-20 [1] RSPM (R 4.2.0)
tidybayes * 3.0.4 2023-03-14 [1] RSPM (R 4.2.0)
tidyr * 1.3.0 2023-01-24 [1] RSPM (R 4.2.0)
tidyselect 1.2.0 2022-10-10 [1] RSPM (R 4.2.0)
tidyverse * 2.0.0 2023-02-22 [1] RSPM (R 4.2.0)
timechange 0.2.0 2023-01-11 [1] RSPM (R 4.2.0)
tzdb 0.3.0 2022-03-28 [1] RSPM (R 4.2.0)
utf8 1.2.3 2023-01-31 [1] RSPM (R 4.2.0)
vctrs 0.6.2 2023-04-19 [1] RSPM (R 4.2.0)
vroom 1.6.1 2023-01-22 [1] RSPM (R 4.2.0)
withr 2.5.0 2022-03-03 [1] RSPM (R 4.2.0)
xfun 0.38 2023-03-24 [1] RSPM (R 4.2.0)
xtable 1.8-4 2019-04-21 [1] RSPM (R 4.2.0)
xts 0.13.1 2023-04-16 [1] RSPM (R 4.2.0)
yaml 2.3.7 2023-01-23 [1] RSPM (R 4.2.0)
zoo 1.8-12 2023-04-13 [1] RSPM (R 4.2.0)
[1] /usr/local/lib/R/site-library
[2] /usr/local/lib/R/library
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Show code
options(width = 80L)